Skip to content

feat(server): New TTL system, enforce max queue length limits, lazy waitpoint creation#2980

Open
ericallam wants to merge 19 commits intomainfrom
ea-branch-117
Open

feat(server): New TTL system, enforce max queue length limits, lazy waitpoint creation#2980
ericallam wants to merge 19 commits intomainfrom
ea-branch-117

Conversation

@ericallam
Copy link
Member

@ericallam ericallam commented Jan 30, 2026

This PR implements a new run TTL system and queue size limits to prevent unbounded queue growth which should help prevent situations where queues enter a "death spiral" where the queue will never be able to catch up.

The main/correct way to battle this situation is to enforce a maximum TTL on all runs (e.g. up to 14 days) where runs that have been queued for that maximum TTL will get auto-expired, making room for newer runs to execute. This required creating a new TTL system that can handle higher workloads and is now deeply integrated into the RunQueue. When runs are enqueued with a TTL, they are added to their normal queue as well as to the TTL queue. When runs are dequeued, they are removed from both their normal queue and the TTL queue. If runs are dequeued by the TTL system, they are removed from their normal queue. Both these dequeues happen automatically so there is no race condition.

The TTL expiration system is also made reliable by expiring runs via a Redis worker, which is enqueued to atomically inside the TTL dequeue lua script.

Optional associated waitpoints

Additionally, this PR implements an optimization where runs that aren't triggered with a dependent parent run will no longer create an associated waitpoint. Associated waitpoints are then lazily created if a dependent run wants to wait for the child run post-facto (via debounce or idempotency), which is a rare situation but is possible. This means fewer waitpoint creations but also fewer waitpoint completions for runs with no dependencies.

Environment Queue Limits

Prevents any single queue growing too large by enforcing queue size limits at trigger time.

  • Queue size checks happen at trigger time - runs are rejected if queue would exceed limit
  • Dashboard UI shows queue limits on both the Queues page and a new Limits page
  • In-memory caching for queue size checks to reduce Redis load

Batch trigger fixes

Currently when a batch item cannot be created for whatever reason (e.g. queue limits) the run will never get created, which means a stalled run if using batchTriggerAndWait. We've updated the system to handle this differently: now when a batch item cannot be triggered and converted into a run, we will eventually (after retrying 8 times up to 30s) we will create a "pre-failed" run with the error details, correctly resolving the batchTriggerAndWait.

@changeset-bot
Copy link

changeset-bot bot commented Jan 30, 2026

⚠️ No Changeset found

Latest commit: 9d90dd8

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 30, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Centralizes queue-size logic (new v3/queueLimits utility and environment queueSizeLimit exposure) and adds an LRU cache for environment queue lengths. Refactors queue validation to per-queue semantics (resolveQueueNamesForBatchItems, validateMultipleQueueLimits) and surfaces itemsSkipped/runCount through batch streaming APIs. Introduces per-item retry for batch queue processing, batch-run-count updates, and a TriggerFailedTaskService for creating pre-failed runs. Adds a TTL expiration subsystem (batched TTL consumers, Redis TTL scripts, ttlSystem callback) and lazy get-or-create waitpoints with related waitpoint APIs. Numerous RunEngine/RunQueue/BatchQueue public API additions and tests updated; UI presenters and routes updated to use the single queueSize quota.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~180 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 28.57% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately summarizes the main changes: new TTL system, queue size limit enforcement, and lazy waitpoint creation.
Description check ✅ Passed PR description provides comprehensive context on TTL system, queue size limits, lazy waitpoints, and batch trigger improvements with rationale and objectives.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ea-branch-117

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vibe-kanban-cloud
Copy link

Review Complete

Your review story is ready!

View Story

Comment !reviewfast on this PR to re-generate the story.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

coderabbitai[bot]

This comment was marked as resolved.

@ericallam ericallam changed the title feat(dashboard): Display environment queue length limits on queues and limits page feat(server): New TTL system, enforce max queue length limits, lazy waitpoint creation Feb 5, 2026
coderabbitai[bot]

This comment was marked as resolved.

@ericallam ericallam marked this pull request as ready for review February 11, 2026 14:50
Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Copy link

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no bugs or issues to report.

Open in Devin Review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments